bert model
Appendix A and Generalization
The directional derivative of the loss function is closely related to the eigenspectrum of mNTKs. For deep models, as mentioned in (Hoffer et al., 2017), the weight distance from its initialization Combining Lemma 2 and Eq. 18, we can discover that as training iterations increase, the model's Rademacher complexity also grows with its weights more deviated from initializations, which We generally follow the settings of Liu et al. (2019) to train BERT All baselines of VGG are initialized with Kaiming initialization (He et al., 2015) and are trained with SGD for Network pruning (Frankle & Carbin, 2018; Sanh et al., 2020; Liu et al., 2021) applies various criteria MA T is the first work to employ the principal eigenvalue of mNTK as the module selection criterion. Table 5 compares the extended MA T, the vanilla BERT model, and SNIP (Lee et al., 2018b) in terms In our implementation, we apply SNIP in a modular manner by calculating the connection sensitivity of each module. In contrast, using the criteria of MA T, we prune 50% of the attention heads while training the remaining ones by MA T. This approach leads to a further acceleration of computations by 56.7% Turc et al. (2019), we apply the proposed MA T to BERT models with different network scales, namely
DRONE: Data-aware Low-rank Compression for Large NLP Models
The representations learned by large-scale NLP models such as BERT have been widely used in various tasks. However, the increasing model size of the pre-trained models also brings efficiency challenges, including inference speed and model size when deploying models on mobile devices. Specifically, most operations in BERT consist of matrix multiplications. These matrices are not low-rank and thus canonical matrix decomposition could not find an efficient approximation. In this paper, we observe that the learned representation of each layer lies in a low-dimensional space.
Incorporating BERT into Parallel Sequence Decoding with Adapters
While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$
DynaBERT: Dynamic BERT with Adaptive Width and Depth
The pre-trained language models like BERT, though powerful in many natural language processing tasks, are both computation and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually compress the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods.
BERTO: an Adaptive BERT-based Network Time Series Predictor with Operator Preferences in Natural Language
Shankar, Nitin Priyadarshini, Singh, Vaibhav, Kalyani, Sheetal, Maciocco, Christian
Abstract--We introduce BERTO, a BERT -based framework for traffic prediction and energy optimization in cellular networks. Built on transformer architectures, BERTO delivers high prediction accuracy, while its Balancing Loss Function and prompt-based customization allow operators to adjust the trade-off between power savings and performance. Natural language prompts guide the model to manage underprediction and overprediction in accordance with the operator's intent. Experiments on real-world datasets show that BERTO improves upon existing models with a 4.13% reduction in MSE while introducing the feature of balancing competing objectives of power saving and performance through simple natural language inputs, operating over a flexible range of 1.4 kW in power and up to 9 variation in service quality, making it well suited for intelligent RAN deployments. Time series data is ubiquitous across all layers of modern communication networks.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- (2 more...)
Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages
Kubík, Jozef, Šuppa, Marek, Takáč, Martin
Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
- North America > United States (0.04)
- Europe > Slovakia > Bratislava > Bratislava (0.04)
- Europe > Middle East > Malta (0.04)
- Asia > China > Hong Kong (0.04)
A Hybrid Classical-Quantum Fine Tuned BERT for Text Classification
Masum, Abu Kaisar Mohammad, Mahmud, Naveed, Najafi, M. Hassan, Aygun, Sercan
Fine-tuning BERT for text classification can be computationally challenging and requires careful hyper-parameter tuning. Recent studies have highlighted the potential of quantum algorithms to outperform conventional methods in machine learning and text classification tasks. In this work, we propose a hybrid approach that integrates an n-qubit quantum circuit with a classical BERT model for text classification. We evaluate the performance of the fine-tuned classical-quantum BERT and demonstrate its feasibility as well as its potential in advancing this research area. Our experimental results show that the proposed hybrid model achieves performance that is competitive with, and in some cases better than, the classical baselines on standard benchmark datasets. Furthermore, our approach demonstrates the adaptability of classical-quantum models for fine-tuning pre-trained models across diverse datasets. Overall, the hybrid model highlights the promise of quantum computing in achieving improved performance for text classification tasks.
- North America > United States > Ohio > Cuyahoga County > Cleveland (0.04)
- North America > United States > Louisiana > Lafayette Parish > Lafayette (0.04)
- North America > United States > Florida > Brevard County > Melbourne (0.04)
- (3 more...)
SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model
Silva, Emanuel C., Salum, Emily S. M., Arantes, Gabriel M., Pereira, Matheus P., Oliveira, Vinicius F., Bicho, Alessandro L.
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervised-trained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: W e expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
- South America > Brazil > Bahia > Salvador (0.06)
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.05)
- Asia > Japan > Honshū > Kantō > Saitama Prefecture > Saitama (0.04)
- North America > United States (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Research Report > Experimental Study (0.93)
- Workflow (0.67)
- Research Report > Promising Solution (0.67)
- Overview (0.67)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.34)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications > Networks (1.00)
- (3 more...)
MIMIC-\RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction
Wang, Jing, Niu, Xing, Zhang, Tong, Shen, Jie, Kim, Juyong, Weiss, Jeremy C.
A crucial component for clinical risk prediction is developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-\RNum{4}-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note \cite{Johnson2023-pg}. The general-purpose MIMIC-IV-Note pose specific challenges for our work: it turns out that the discharge summaries are too lengthy for typical natural language models to process, and the clinical events of interest often are not accompanied with explicit timestamps. Therefore, we propose a new framework that works as follows: 1) we break each discharge summary into manageably small text chunks; 2) we apply contextual BM25 and contextual semantic search to retrieve chunks that have a high potential of containing clinical events; and 3) we carefully design prompts to teach the recently released Llama-3.1-8B \cite{touvron2023llama} model to identify or infer temporal information of the chunks. The obtained dataset is informative and transparent that standard models fine-tuned on the dataset achieves significant improvements in healthcare applications. In particular, the BERT model fine-tuned based on our dataset achieves 10\% improvement in accuracy on medical question answering task, and 3\% improvement in clinical trial matching task compared with the classic BERT. The dataset is available at https://physionet.org/content/mimic-iv-ext-22mcts/1.0.0. The codebase is released at https://github.com/JingWang-RU/MIMIC-IV-Ext-22MCTS-Temporal-Clinical-Time-Series-Dataset.
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)